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Abstract 

If a discrete probability distribution in a model being tested for goodness-of-fit is not 
close to uniform, then forming the Pearson ~)(^ statistic can involve division by nearly 
zero. This often leads to serious trouble in practice — even in the absence of round-off 
errors — as the present article illustrates via numerous examples. Fortunately, with the 
now widespread availability of computers, avoiding all the trouble is simple and easy: 
without the problematic division by nearly zero, the actual values taken by goodness- 
of-fit statistics are not humanly interpretable, but black-box computer programs can 
rapidly calculate their precise significance. 
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1 Introduction 



A basic task in statistics is to ascertain whether a given set of independent and identically 
distributed (i.i.d.) draws does not come from a given "model," where the model may consist 
of either a single fully specified probability distribution or a parameterized family of prob- 
ability distributions. The present paper concerns the case in which the draws are discrete 
random variables, taking values in a finite or countable set. In accordance with the standard 
terminology, we will refer to the possible values of the discrete random variables as "bins" 
("categories," "cells," and "classes" are common synonyms for "bins"). 

A natural approach to ascertaining whether the i.i.d. draws do not come from the model 
uses a root-mean-square statistic. To construct this statistic, we estimate the probability 
distribution over the bins using the given i.i.d. draws, and then measure the root-mean-square 
difference between this empirical distribution and the model distribution; see, for example, 
[T7] . page 123 of [21], or Section |2] below. If the draws do in fact arise from the model, 
then with high probability this root-mean-square is not large. Thus, if the root-mean-square 
statistic is large, then we can be confident that the draws do not arise from the model. 

To quantify "large" and "confident," let us denote by x the value of the root-mean-square 
for the given i.i.d. draws; let us denote by X the root-mean-square statistic constructed 
for different i.i.d. draws that definitely do in fact come from the model (if the model is 
parameterized, then we draw from the distribution corresponding to the parameter given 
by a maximum-likelihood estimate for the experimental data). The significance level a is 
then defined to be the probability that X > x (viewing X — but not random 
variable). The confidence level that the given i.i.d. draws do not arise from the model is the 
complement of the significance level, namely 1 — a. (See Remark 11.21 concerning our use of 
the term "significance level" as synonymous with the alternative term "p- value.") 

Now, the significance levels for the simple root-mean-square statistic can be different 
functions of x for different model probability distributions. To avoid this seeming incon- 
venience asymptotically (in the limit of large numbers of draws), K. Pearson replaced the 
uniformly weighted mean in the root-mean-square with a weighted average; the weights are 
the reciprocals of the model probabilities associated with the various bins. This produces 
the classic statistic — see, for example, |T3j or formula ([2]) below. However, when model 
probabilities can be small (relative to others in the same distribution), this weighted average 
can involve division by nearly zero. As demonstrated below, dividing by nearly zero severely 
restricts the statistical power of — even in the absence of round-off errors — especially 
when dividing by nearly zero for each of many bins. Moreover, this problem arises whether 
or not every bin contains several draws (see Remark 11.11) . 

The main thesis of the present article is that using only the classic statistic is no 
longer appropriate, that certain alternatives are far superior now that computers are widely 
available. We demonstrate below that the simple root-mean-square, used in conjunction with 
the log-likelihood-ratio "G^" goodness-of-fit statistic, is generally preferable to the classic 

statistic. (The log-likelihood-ratio also involves division by nearly zero, but tempers 
this somewhat by taking a logarithm.) We do not make any claim that this is the best 
possible alternative. In fact, the discrete Kolmogorov-Smirnov statistic (or one of its variants, 
such as the discrete Kuiper statistic — see, for example, |3j or [5j) can be more powerful 
than the root-mean-square in certain circumstances; in any case, the discrete Kolmogorov- 
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Smirnov statistic and the root-mean-square are similar in many ways, and complementary 
in others. We focus on the root-mean-square largely because it is so simple and easy to 
understand; for example, computing the confidence levels of the root-mean-square in the 
limit of large numbers of draws is trivial, even when estimating continuous parameters via 
maximum-likelihood methods (see [15] and [H]). Furthermore, the classic statistic is just 
a weighted version of the root-mean-square, facilitating their comparison. Finally, and 
the root-mean-square coincide when the model distribution is uniform. 

Please note that all statistical tests reported in the present paper (including those involv- 
ing the statistic) are exact; we compute significance levels via Monte-Carlo simulations 
providing guaranteed error bounds (see Section [3] below) . In all numerical results reported 
below, we generated random numbers via the C programming language procedure given on 
page 9 of [13], implementing the recommended complementary multiply with carry. 

To be sure, the problem with is neither subtle nor esoteric. For a particularly revealing 
example, see Subsection 14.51 below. 

Appropriate rebinning to uniformize the probabilities associated with the bins can miti- 
gate much of the problem with y^. Yet rebinning is a black art that is liable to improperly 
influence the result of a goodness-of-fit test. Moreover, rebinning requires careful extra work, 
making less easy-to-use. A principal advantage of the root-mean-square is that it does not 
require any rebinning; indeed, the root-mean-square is most powerful without any rebinning. 

Remark 1.1. In many of our examples, there is a bin for which the expected number of 
draws is very small under the model. Please note that, although it is natural for the expected 
numbers of draws for some bins to be very small, especially when the model has many bins, 
the advantage of the root-mean-square over x^ is substantial even when the expected number 
of draws is at least five for every bin; see, for example. Subsection 15. 1. II or Subsection 15.2.61 

Remark 1.2. Please beware that we treat "significance level" as synonymous with the 
alternative term "p-value." These two terms are not exactly the same in the classical ter- 
minology. However, the older concept of "significance level" is no longer very relevant, due 
to the proliferation of computer technology; there is no longer much reason to calculate and 
store tables of thresholds for goodness-of-fit statistics at arbitrarily fixed significance levels 
— we can now compute "p-values" on the fly, as needed. The objective of a significance test 
is not really to accept or reject a hypothesis at some arbitrary threshold of significance, but 
instead to provide significance levels that can inform statisticians' further analysis. 

Remark 1.3. Goodness-of-fit tests are probably most useful in practice not for ascertaining 
whether a model is correct or not, but for determining whether the discrepancy between the 
model and experiment is larger than expected random fluctuations. While models outside 
the physical sciences typically are not exactly correct, testing the validity of using a model 
for virtually any purpose requires knowing whether observed discrepancies are due to inac- 
curacies or inadequacies in the models or (on the contrary) could be due to chance arising 
from necessarily finite sample sizes. Thus, goodness-of-fit tests are critical even when the 
models are not supposed to be exactly correct, in order to gauge the size of the unavoidable 
random fluctuations. For further clarification, see [TU] and the remarkably extensive title of 
the original article [14] that introduced the x^ test for goodness-of-fit. 
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Remark 1.4. Combining the root-mean-square methodology and the statistical bootstrap 
(see, for example, should produce a test for whether two separate sets of draws arise from 
the same or from different distributions, when each set is taken i.i.d. from some (unspecified) 
distribution; the two distributions associated with the sets may differ. This is related to 
testing for association/independence in contingency-tables/cross-tabulations that have only 
two rows. 

2 Definitions of the test statistics 

In this section, we review the definitions of four goodness-of-fit statistics — the root-mean- 
square, the log-likelihood-ratio or G^, and the Freeman- Tukey or Hellinger distance. 
The latter three statistics are the best-known members of the standard Cressie-Read power- 
divergence family (see, for example, We use pi, p2, • • • , Pn-i, Pn to denote the expected 
fractions of m i.i.d. draws falling in n bins, numbered 1, 2, . . . , — 1, ra, respectively, and we 
use qi,q2, ■ ■ . , Qn-i, In to denote the observed fractions of the m draws falling in the respective 
bins. That is, pi, • • • , Pn-i, Pn are the probabilities associated with the respective bins in 
the model distribution, whereas gi, g2, • • • , q-n-i, Qn are the fractions of the m draws falling 
in the respective bins when we take the draws from a distribution that may differ from the 
model — their actual distribution. Specifically, if ii, i2, ■ ■ ■ , im-i, im are the observed i.i.d. 
draws, then is — times the number of ii, ^2, • • • , im-i, im falling in bin fc, for = 1, 2, . . . , 
n — 1, n. If the model is parameterized by a parameter 9, then the probabilities pi, p2-, ■ ■ ■ , 
Pn-i, Pn are functions of 9; if the model is fully specified, then we can view the probabilities 
Pi, P2, • ■ • , Pn-i, Pn as constant as functions of 9. We use 9 to denote a maximum-likelihood 
estimate of 9 obtained from qi, q2, . . . , qn-i, qn- 

With this notation, the root-mean-square statistic is 




(1) 



We use the designation "root-mean-square" to refer to X. 
The classical Pearson statistic is 




(2) 



under the convention that (g^ — PkiP))"^ / Pk{9) = if Pk{9) = = g^. We use the standard 
designation "x^" to refer to . 

The log-likelihood-ratio or "G^" statistic is 




(3) 



under the convention that qu lia{qk / Pk{9)) = if = 0. We use the common designation 
to refer to G^. 
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The Freeman- Tukey or Hellinger-distance statistic is 



n , X 2 n 

k=l ^ ^ fe=l 



{qk-Pk{0)f 



qk + ypkid) 



(4) 



We use the well-known designation "Freeman- Tukey" to refer to H^. 

In the limit that the number m of draws is large, the distributions of defined in ([2]), 
defined in ([3]), and defined in (jlj) are all the same when the actual underlying distribu- 
tion of the draws comes from the model (see, for example, [12]). However, when the number 
m of draws is not large, then their distributions can differ substantially. In all our data and 
power analyses, we compute confidence levels via Monte-Carlo simulations, without relying 
on the number m of draws to be large. 



3 Hypothesis tests with parameter estimation 

In this section, we discuss the testing of hypotheses involving parameterized models: Given 
a family p{6) of probability distributions parameterized by 6, and given observed i.i.d. draws 
from some actual underlying (unknown) distribution p, we would like to test the hypothesis 

H'q : for some 9, p = p{0), (5) 

against the alternative 

H[:ioi aWe, py^p{e). (6) 

Given only finitely many draws, the significance level for such a test would have to be inde- 
pendent of the parameter 6, since the proper value for 6 is unknown [6 is known as a nuisance 
parameter). Unfortunately, it is not clear how to devise such a test when the probability 
distributions are discrete. None of the standard methods (including the log-likelihood- 
ratio, the Freeman- Tukey /Hellinger distance, and other power-divergence statistics) produce 
significance levels that are independent of the parameter 6. Some methods do produce sig- 
nificance levels that are independent of 6 in the limit of large numbers of draws, but this is 
not especially useful, since in the limit of large numbers of draws any actual parameter 6 
would be almost surely known anyway (see Appendix [B] for further elaboration). 
In the present paper, we test the significance of assuming 

Hq : p = p{9) for the particular observed value of 9, (7) 

where ^ is a maximum-likelihood estimate of 9; that is, Hq is the hypothesis that p = p{9) 
for the value of 9 associated with the single realization of the experiment that was measured 
(subsequent repetitions of the experiment, including those considered when calculating the 
significance level as in Remark 13.31 can yield different estimates of the parameter, even 
though the repetitions' actual distribution p is the same). Of course, the accuracy of the 
estimate 9 generally improves as the number of draws increases; in fact, testing ^ and 
testing ([7]) are asymptotically equivalent, in the limit of large numbers of draws (see [16]). 

As testing the hypothesis H'q defined in ([5]) does not seem to be feasible in general when 
the probability distributions are discrete and there are more than just a few bins, we focus 
on testing the closely related assumption Hq defined in ([7j). The latter is more relevant 
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for many applications, anyways — plots typically display the particular fitted distribution 
in ([7]); interpreting such plots naturally involves ([7]). All tests of the present paper concern the 
significance of assuming Hq defined in ([7]) (if the model is fully specified, then the probability 
distribution p{6) is the same for all 6). Please be sure to bear in mind Remark ll.3l of Section[T] 

Remark 3.1. Another means of handling nuisance parameters is to test the hypothesis 

Hq : p = p{6) for all possible realizations of the experiment; (8) 

that is, Hq is the hypothesis that p = p{6) and that p{6) always takes exactly the same 
value during repetitions of the experiment. The assumption that ([8]) is true seems to be 
more extreme, a more substantial departure from than ([7]). Nevertheless, testing dH]) is 
standard; see, for example. Section 6 of [4]. Assuming amounts to conditioning ([S]) on a 
statistic that is minimally sufficient for estimating 6; computing the associated significance 
levels is not always trivial. Testing the significance of assuming ([7]) would seem to be more 
apropos in practice for applications in which the experimental design does not enforce that 
repeated experiments always yield the same value for p{6). 

Remark 3.2. The parameter 6 can be integer-valued, real-valued, complex-valued, vector- 
valued, matrix-valued, or any combination of the many possibilities. For instance, when we 
do not know the proper ordering of the bins a priori, we must include a parameter that 
contains a permutation (or permutation matrix) specifying the order of the bins; maximum- 
likelihood estimation then entails sorting the model and all empirical frequencies (whether 
experimental or simulated) — see Subsection 14.21 for details. With Remark 13.31 we need not 
contemplate how many degrees of freedom are in a permutation. 

Remark 3.3. To compute the level of significance of assuming ([7]), we can use Monte-Carlo 
simulations (very similar to those in |3]). First, we estimate the parameter 6 from the m given 
experimental draws, obtaining 6, and then calculate the statistic under consideration (x^, 
G^, Freeman- Tukey, or the root-mean-square), using the given data and taking the model 
distribution to be p{9). We then run many simulations. To conduct a single simulation, we 
perform the following three-step procedure: 

1. we generate m i.i.d. draws according to the model distribution p{9), where 9 is the 
estimate calculated from the experimental data, 

2. we estimate the parameter 9 from the data generated in Step 1, obtaining a new 
estimate 9, and 

3. we calculate the statistic under consideration (x^, G^, Freeman- Tukey, or the root- 
mean-square), using the data generated in Step 1 and taking the model distribution to 
be p{9), where 9 is the estimate calculated in Step 2 from the data generated in Step 1. 

After conducting many such simulations, we may estimate the confidence level for reject- 
ing ([7]) as the fraction of the statistics calculated in Step 3 that are less than the statistic 
calculated from the empirical data. (Recall that a significance level of a is the same as a 
confidence level of 1 — a.) The accuracy of the estimated confidence level is inversely propor- 
tional to the square root of the number of simulations conducted; for details, see Remark 13.41 
below. This procedure works since, by definition, the confidence level is the probabihty that 
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where 



n is the number of all possible values that the draws can take, 

d is the measure of the discrepancy between two probability distributions over n bins 
(i.e., between two vectors each with n entries) that is associated with the statistic 
under consideration [d is the Euchdean distance for the root-mean-square, a weighted 
Euclidean distance for x^, the Hellinger distance for the Freeman- Tukey statistic, and 
the relative entropy — the KuUback-Leibler divergence — for the log-likelihood-ratio), 

gi, . . . , 9n-i5 9n are the fractions of the m given experimental draws falling in the 
respective bins. 



Q is the estimate of Q obtained from gi, q^^ 



1 Qn—l: Qny 



• Qi, Q2, ■ ■ ■ , Qn-i, Qn are the fractions of m i.i.d. draws falling in the respective bins 
when taking the draws from the distribution p{6) assumed in (JTj), and 

• is the estimate of the parameter 6 obtained from Qi, Q2, . . . , Qn-i, Qn (note that 
is not necessarily always equal to 6: even under the null hypothesis, repetitions of 
the experiment could yield different estimates of the parameter; see also Remark |B.2|) . 

When taking the probability that ([9]) occurs, only the left-hand side is random — we regard 
the left-hand side of ([9]) as a random variable and the right-hand side as a fixed number 
determined via the experimental data. As with any probability, to compute the probability 
that (Q occurs, we can calculate many independent realizations of the random variable and 
observe that the fraction which satisfy ([9]) is a good approximation to the probability when 
the number of realizations is large; Remark 13.41 details the accuracy of the approximation. 
(The procedure in the present remark follows this prescription to estimate confidence levels.) 

Remark 3.4. The standard error of the estimate from Remark 13.31 for an exact significance 
level of a is ^/a{l — a)/i, where £ is the number of Monte-Carlo simulations conducted to 
produce the estimate. Indeed, each simulation has probability a of producing a statistic that 
is greater than or equal to the statistic corresponding to an exact significance level of a. Since 
the simulations are all independent, the number of the i simulations that produce statistics 
greater than or equal to that corresponding to level a follows the binomial distribution with 
i trials and probability a of success in each trial. The standard deviation of the number 
of simulations whose statistics are greater than or equal to that corresponding to level a 
is therefore ia{l — a), and so the standard deviation of the fraction of the simulations 
producing such statistics is ^ya{l — a)/i. Of course, the fraction itself is the Monte-Carlo 
estimate of the exact significance level (we use this estimate in place of the unknown a when 
calculating the standard error \/a{l — a)/£). 
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4 Data analysis 



In this section, we use several data sets to investigate the performance of goodness-of-fit 
statistics. The root-mean-square generally performs much better than the classical statistics. 
We take the position that a user of statistics should not have to worry about rebinning; we 
discuss rebinning only briefly. We compute all significance levels via Monte Carlo as in 
Remark ESI Remark [231 details the guaranteed accuracy of the computed significance levels. 

4.1 Synthetic examples 

To better explicate the performance of the goodness-of-fit statistics, we first analyze some 
toy examples. We consider the model distribution 

Pi = \, (10) 

P2 = \, (11) 

and 

1 . s 

Pk = 12 

for k = 3, 4, . . . , n — 1, n. For the empirical distribution, we first use m = 20 draws, with 
15 in the first bin, 5 in the second bin, and no draw in any other bin. This data is clearly 
unlikely to arise from the model specified in f lT0|) -f lT2|) . but we would like to see exactly how 
well the various goodness-of-fit statistics detect the obvious discrepancy. 

Figured] plots the significance levels for testing whether the empirical data arises from the 
model specified in f|T0l) - (fT2l) . We computed the significance levels via 4,000,000 Monte-Carlo 
simulations (that is, 4,000,000 per empirical significance level being evaluated), with each 
simulation taking m = 20 draws from the model. The root-mean-square consistently and 
with extremely high confidence rejects the hypothesis that the data arises from the model, 
whereas the classical statistics find less and less evidence for rejecting the hypothesis as the 
number n of bins increases; in fact, the significance levels for the classical statistics get very 
close to 1 as increases — the discrepancy of (|T2l) from is usually less than the discrepancy 
of f|T2|) from a typical realization drawn from the model, since under the model the sum of 
the expected numbers of draws in bins 3, 4, . . . , n — 1, n is m/2. 

Figure [T] demonstrates that the root-mean-square can be much more powerful than the 
classical statistics, rejecting with nearly 100% confidence while the classical statistics report 
nearly 0% confidence for rejection. Moreover, the classical statistics can report significance 
levels very close to 1 even when the data manifestly does not arise from the model. (Inci- 
dentally, the model for smaller n can be viewed as a rebinning of the model for larger n. The 
classical statistics do reject the model for smaller n, while asserting for larger n that there 
is no evidence for rejecting the model.) The performance of the classical statistics displays 
a dramatic dependence on the number {n — 2) of unlikely bins in the model, even though 
the data are the same for all n. This suggests a sure-fire scheme for supporting any model 
(no matter how invalid) with arbitrarily high significance: just append enough irrelevant, 
more or less uniformly improbable bins to the model, and then report the significance levels 
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Figure 1: Significance levels for the hypothesis that the model f|T0|l - f|T2|) agrees with the data 
of 15 draws in the first bin, 5 draws in the second bin, and no draw in any other bin 



for the classical goodness-of-fit statistics. In contrast, the root-mean-square robustly and 
reliably rejects the invalid model, independently of the size of the model. 

We will see in the following section that the classic Zipf power law behaves similarly. 

For another example, we again consider the model specified in f|T0|) - f|T2|) . For the empir- 
ical distribution, we now use m = 96 draws, with 36 in the first bin, 12 in the second bin, 
1 each for bins 3, 4, . . . , 49, 50, and no draw in any other bin. As before, this data clearly 
is unlikely to arise from the model specified in (fT0l)-( fT2|) . but we would like to see exactly 
how well the various goodness-of-fit statistics detect the obvious discrepancy. 

Figure [2] plots the significance levels for testing whether the empirical data arises from 
the model specified in (fT0|) - (fT2|) . We computed the significance levels via 160,000 Monte- 
Carlo simulations (that is, 160,000 per empirical significance level being evaluated), with 
each simulation taking m = 96 draws from the model. Yet again, the root-mean-square 
consistently and confidently rejects the hypothesis that the data arises from the model, 
whereas the classical statistics find little evidence for rejecting the manifestly invalid model. 



4.2 Zipf 's power law of word frequencies 

Zipf popularized his eponymous law by analyzing four "chief sources of statistical data re- 
ferred to in the main text [23]" (this is a quotation from the "Notes and References" section 
— page 311 — of [23]); in [23], the chief source for the English language is [7]. We revisit 
the data from [7] in the present subsection to assess the performance of the goodness-of-fit 
statistics. 

We first analyze List 1 of [7], which consists of 2,890 different English words, such that 
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Figure 2: Significance levels for the hypothesis that the model ( !T0|) -( !T2l) agrees with the 
data of 36 draws in the first bin, 12 draws in the second bin, 1 draw each in bins 3, 4, . . . , 
49, 50, and no draw in any other bin 



there are 13,825 words in total counting repetitions; the words come from the Buffalo Sunday 
News of August 8, 1909. We randomly choose m = 10,000 of the 13,825 words to obtain a 
corpus of m = 10,000 draws over 2,890 bins. Figure E] plots the frequencies of the different 
words when sorted in rank order (so that the frequencies are nonincreasing) . Using goodness- 
of-fit statistics we test the significance of the (null) hypothesis that the empirical draws 
actually arise from the Zipf distribution 

PdO) - ^ (13) 
for k = 1, 2, . . . , n — 1, n, where 6' is a permutation of the integers 1, 2, . . . , n — 1, n, and 

we estimate the permutation 6 via maximum-likelihood methods, that is, by sorting the 
frequencies: first we choose ki to be the number of a bin containing the greatest number of 
draws among all n bins, then we choose ^2 to be the number of a bin containing the greatest 
number of draws among the remaining n — 1 bins, then we choose k^ to be the number of a 
bin containing the greatest among the remaining n — 2 bins, and so on, and finally we find 
9 such that 9{ki) = 1, 9{k2) = 2, . . . , 9{kn-i) = n — 1, 9{kn) = n. We have to obtain the 
ordering 9 from the data via such sorting since we do not know the proper ordering a priori. 

Similarly, we do not know the proper value of the number n of bins, so in Figure H] we plot 
significance levels (each computed via 40,000 Monte-Carlo simulations) for varying values of 
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n; although List 1 of [7] involves only 2,890 distinct words, we must also include bins for 
words that did not appear in the original list, words whose frequencies are zeros for List 1 
of Note that Figure S] displays the significance levels with n = 2,890 for reference, even 
though n must be independent of the data, and so n must be substantially larger than 2,890 
in order for the assumptions of goodness-of-fit testing to hold. 

With respect to testing goodness-of-fit, the number n of bins is the number of words in 
the dictionary from which List 1 of [7] was drawn. It is not clear a priori which dictionary 
is appropriate. Fortunately, the significance levels for the root-mean-square are always to 
several digits of accuracy, independent of the value of n — the root-mean-square determines 
that List 1 does not follow the classic Zipf distribution (defined in (fT3|) and ( fl^ ) for any 
n. In contrast, the significance levels for the classical statistics vary wildly depending on 
the value of n. In fact, for any of the classical statistics, and for any prescribed number a 
between 0.05 and 0.95, there is at least one value of n between 4,000 and 40,000 such that 
the significance level is a. Thus, without knowing the proper size of the dictionary a priori, 
the classical statistics are meaningless. 

Unsurprisingly, analyzing List 5 of [7J produces results analogous to those reported above 
for List 1. List 5 consists of 6,002 different English words, such that there are 43,989 words in 
total counting repetitions; the words come from amalgamating Lists 1-4 of [7]. We randomly 
choose m = 20,000 of the 43,989 words to obtain a corpus of m = 20,000 draws over 6,002 
bins. Figure [5] plots the frequencies of the different words when sorted in rank order (so that 
the frequencies are nonincreasing) . 

Again we do not know the proper value of the number n of bins, so in Figure |6] we plot 
significance levels (each computed via 40,000 Monte-Carlo simulations) for varying values of 
n; although List 5 of [7] involves only 6,002 distinct words, we must also include bins for 
words that did not appear in the original list, words whose frequencies are zeros for List 5 
of [?]. Please note that Figure |6] displays the significance levels with n = 6,002 for reference, 
even though n must be independent of the data, and so n must be substantially larger than 
6,002 in order for the assumptions of goodness-of-fit testing to hold. Comparing Figures H] 
and E] shows that the above remarks about List 1 pertain to the analysis of the larger List 5, 
too. Once again, without knowing the proper size of the dictionary a priori, the classical 
statistics are meaningless, whereas the root-mean-square is very powerful. 

Interestingly, by introducing parameters 6i, 62, and ^3 to fit perfectly the bins containing 
the three greatest numbers of draws, a truncated power-law becomes a good fit for the corpus 
of 20,000 words drawn randomly from List 5 of [7], with the number n of bins set to 7,500. 
Indeed, let us consider the model 




. . , 7499, 7500 



where 



C — Ce^fi2,e-ifiA — — ^7500 /, a. — ' 



1 — 9i — 62 — ^3 

with 6q being a permutation of the integers 1, 2, . . . , 7499, 7500, and 61, 62, 0^, 64 being 
nonnegative real numbers; we estimate 6*0, 9i, 62, 6*3, 9^ via maximum- likelihood methods. 
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Figure 3: Numbers of occurrences of the various words (one bin for each distinct word) in a 
corpus of 10,000 random draws from List 1 of [7] 



determining 6q by sorting as discussed above, and setting 6i, 62, and 63 to be the three 
greatest relative frequencies. This model fits the empirical data exactly in the bins whose 
probabilities under the model are 9i, 62, and 6*3 — there will be no discrepancy between the 
data and the model in those bins — so that these bins do not contribute to any goodness- 
of-fit statistic, aside from altering the number of draws in the remaining bins. Of the 20,000 
total draws in the given experimental data, 16,486 do not fall in the bins associated with the 
three most frequently occurring words. The maximum-likelihood estimate of the power-law 
exponent 64 for the experimental data turns out to be about 1.0484. 

For the model defined in ( fT5|) and ( |T6|) . the significance levels calculated via 4,000,000 
Monte-Carlo simulations are 

• x^: -510 

• G^: .998 

• Freeman- Tukey: 1.000 

• root-mean-square: .587 

Thus, all four statistics indicate that the truncated power-law model defined in ( ITSl) and (fT6l) 
is a good fit. This is in accord with Figured in which all but the three greatest frequencies 
appear to follow a truncated power-law. 
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Figure 4: Significance levels for the data plotted in Figure [3] to follow the Zipf distribution 
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Figure 5: Numbers of occurrences of the various words (one bin for each distinct word) in a 
corpus of 20,000 random draws from List 5 of |7j 
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Figure 6: Significance levels for the data plotted in Figure [5] to follow the Zipf distribution 



4.3 A Poisson law for radioactive decays 

Table [H summarizes the classic example of a Poisson-distributed experiment in radioactive 
decay from [19] ; Figure [7] plots the data, along with the Poisson distribution whose mean is 
the same as the data's. Figure [H] reports the significance levels for testing whether the data, 
while retaining only bins 1, 2, ... , n — 1, n, are distributed according to a Poisson distribution 
(the model Poisson distribution is also truncated to the first n bins, with the mean estimated 
from the data). Since the total number m of draws depends little on the numbers in bins 
13, 14, 15, . . . , the truncation amounts to ignoring draws in bins n + 1, + 2, + 3, . . . 
when n > 12, and demonstrates that the scant experimental draws in bins 13-15 strongly 
infiuence the significance levels of the classical statistics. We computed the significance levels 
via 40,000 Monte-Carlo simulations (for each number n of bins and each of the four statistics), 
estimating the mean of the model Poisson distribution for each simulated data set. All four 
goodness-of-fit statistics indicate reasonably good agreement between the data and a Poisson 
distribution; the classical statistics are very sensitive in the tail to discrepancies between the 
data and the model distribution, whereas the root-mean-square is relatively insensitive to 
the truncation after 12 or more bins. 

4.4 A Poisson law for counting with a haemacytometer 

Page 357 of ^20j reports the number of yeast cells observed in each of 400 squares in a 
haemacytometer microscope slide. Table [2] displays the counts; Figure [9] plots them, along 
with the Poisson distribution whose mean matches the data's. The significance levels for the 
data to arise from a Poisson distribution (with the mean estimated from the data) are 
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Table 1: Numbers of a-particles emitted by a film of polonium in 2608 intervals of 7.5 seconds 



bin number 


number of particles observed 
in an interval of 7.5 seconds 


number of such intervals 


1 





57 


2 


1 


203 


3 


2 


383 


4 


3 


525 


5 


4 


532 


6 


5 


408 


7 


6 


273 


8 


7 


139 


9 


8 


45 


10 


9 


27 


11 


10 


10 


12 


11 


4 


13 


12 





14 


13 


1 


15 


14 


1 


16, 17, 18, ... 


15, 16, 17, ... 






1,2,3,4,5,... 0,1,2,3,4,... 2608 



600 




bin number 

Figure 7: The data in Table [1] (the dots) and the best-fit Poisson distribution (the lines) 
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Figure 8: Significance levels for the distribution of Tabled] to be Poisson 

• x^: -627 

• G^: .365 

• Freeman- Tukey: .111 

• root-mean-square: .490 

We calculated the significance levels via 4,000,000 Monte-Carlo simulations, estimating the 
mean of the model Poisson distribution for each simulated data set. Evidently, all four 
statistics report that a Poisson distribution is a reasonably good model for the experimental 
data. 

4.5 A Hardy- Weinberg law for Rhesus blood groups 

In a population with suitably random mating, the proportions of pairs of Rhesus haplotypes 
in members of the population (each member has one pair) can be expected to follow the 
Hardy- Weinberg law (see, for example, [H]), namely to arise via random sampling from the 
model 

PjA(^l^(^2, ■ ■ ■ ,08,9q) = ^^^^i ^ ^ (17) 

for J, = 1, 2, . . . , 8, 9 with j > k, under the constraint that 

9 

J2^k = l, (18) 

k=l 

17 



Table 2: Numbers of yeast cells in 400 squares of a haemacytometer 



bin number 


number of yeast in a square 


number of such squares 


1 








2 


1 


20 


3 


2 


43 


4 


3 


53 


5 


4 


86 


6 


5 


70 


7 


6 


54 


8 


7 


37 


9 


8 


18 


10 


9 


10 


11 


10 


5 


12 


11 


2 


13 


12 


2 


14, 15, 16, ... 


13, 14, 15, ... 






1,2,3,4,5,... 0,1,2,3,4,... 400 
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Table 3: Frequencies of pairs of Rhesus haplotypes 

k 



j\ 


1 


2 


3 


4 


5 


6 


7 


8 


9 


1 


1236 


















2 


120 


3 
















3 


18 




















4 


982 


55 


7 


249 












5 


32 


1 





12 













6 


2582 


132 


20 


1162 


29 


1312 








7 


6 








4 





4 









8 


2 

























9 


115 


5 


2 


53 


1 


149 








4 



where the parameters 6i, 62, ■ ■ ■ , Og, dg are the proportions of the nine Rhesus haplotypes in 
the population (their maximum-likelihood estimates are the proportions of the haplotypes 
in the given data). For j, /c = 1, 2, . . . , 8, 9 with j > k, therefore, pj^k is the expected 
probability that the pair of haplotypes in the genome of an individual is the pair j and k. 

In this formulation, the hypothesis of suitably random mating entails that the members 
of the sample population are i.i.d. draws from the model specified in ( IT71) : if a goodness-of-fit 
statistic rejects the model with high confidence, then we can be confident that mating has 
not been suitably random. Table [3] provides data on m = 8297 individuals; we duplicated 
Figure 3 of [11] to obtain Table [31 

The significance levels calculated via 4,000,000 Monte-Carlo simulations are 

• x^: -693 

• G^: .600 

• Freeman- Tukey: .562 

• negative log-likelihood (see Remark 14.21 below) : .649 

• root-mean-square: .039 

Unlike the root-mean-square, the classical statistics are blind to the significant discrepancy 
between the data and the Hardy- Weinberg model. 

Remark 4.1. For the example of the present subsection, rejecting the null hypothesis 
from Section [3] might seem in principle to be more interesting than rejecting the assump- 
tion ([7]). Fortunately, the difference between ([5]) and ([7]) is essentially irrelevant for the 
root-mean-square in this example. Indeed, the root-mean-square is not very sensitive to bins 
associated with the parameters whose estimated values are potentially inaccurate — the 
potentially inaccurate estimates are all small, and the root-mean-square is not very sensitive 
to bins whose probabilities under the model are small relative to others. 
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Table 4: Frequencies of antigen genotypes 



k 



j\ 


1 


2 


3 


4 


1 











2 


3 


1 






3 


5 


18 


1 




4 


3 


7 


5 


2 



Remark 4.2. The term "negative log-likelihood" used in the present section refers to the 
statistic that is simply the negative of the logarithm of the likelihood. The negative log- 
likelihood is the same statistic used in the generalization of Fisher's exact test discussed 
in [11]; unlike G^, this statistic involves only one likelihood, not the ratio of two. We mention 
the negative log-likelihood just to facilitate comparisons with [11]; we are not asserting that 
the likelihood on its own (rather than in a ratio) is a good gauge of the relative sizes of 
deviations from a model. 

Remark 4.3. Table H] provides data on m = 45 individuals from the other set of real- 
world measurements given in [Tl]; we duplicated Figure 2 of [11] to obtain Table HI The 
associated Hardy- Weinberg model is then the same as f[T7|) . but with only four parameters, 
^1) ^2, (^3, 6^4, such that ^^=1 Ok = 1- The significance levels calculated via 4,000,000 Monte- 
Carlo simulations are 

. x': -021 

• G^: .013 

• Freeman- Tukey: .027 

• negative log-likelihood (see Remark 14.21 above) : .016 

• root-mean-square: .0019 

Again the root-mean-square is more powerful than the classical statistics (though in this case 
all these statistics report significant discrepancies between the data and the Hardy- Weinberg 
model). 

4.6 Symmetry between the self-reported health assessments of 
foreign- and US-born Asian Americans 

Using propensity scores, [8j matched each of 335 surveyed foreign-born Asian Americans 
to a similar surveyed US-born Asian American. Table duplicates Table 4 of |8j, which 
tabulates the numbers of matched pairs reporting various combinations of self-rated physical 
health; the model used for generating the propensity scores did not explicitly incorporate the 
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health ratings. Table [5] does not reveal any significant difference between foreign-born Asian 
Americans' ratings of their health and US-born Asian Americans'. Indeed, the significance 
levels calculated via 4,000,000 Monte-Carlo simulations for testing the symmetry of Tabled 
are 

• x^: -784 

• G^: .739 

• Freeman- Tukey: .642 

• root-mean-square: .973 

After noting that does not reveal any statistically significant asymmetry in Table El 
[H] reports that, "to address the issue of power of this test, we investigated what is the smallest 
departure from symmetry that our test could detect. . . ." Such an investigation requires 
considering modifications to Table |5l Table |6] provides one possible modification. The 
significance levels calculated via 4,000,000 Monte-Carlo simulations for testing the symmetry 
of Table [6] are 

• x^- -109 

• G2. 123 

• Freeman- Tukey: .155 

• root-mean-square: .014 

Evidently, the root-mean-square is more powerful for detecting the asymmetry of Table [61 

Table [7] provides another hypothetical cross-tabulation. The significance levels calculated 
via 64,000,000 Monte-Carlo simulations for testing the symmetry of Table [7] are 

• x^- -0015 

• G^: .00016 

• Freeman- Tukey: .000006, i.e., 6E-6 

• root-mean-square: .131 

The classical statistics are much more powerful for detecting the asymmetry of Table [71 con- 
trasting how the root-mean-square is more powerful for detecting the asymmetry of Table [61 
Indeed, the root-mean-square statistic is not very sensitive to relative discrepancies between 
the model and actual distributions in bins whose associated model probabilities are small. 
When sensitivity in these bins is desirable, we recommend using both the root-mean-square 
statistic and an asymptotically equivalent variation of x^, such as the log-likelihood-ratio 
G^] see, for example, [T7] . 
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Table 5: Self-reported physical health for matched pairs of Asian Americans 





excellent 


very good 


foreign-born 
good 


fair 


poor 


excellent 


10 


21 


22 


5 





very good 


24 


53 


43 


15 


3 


US-born good 


21 


43 


34 


11 





fair 


3 


11 


8 


4 


1 


poor 


1 


1 


1 









Table 6: A variation on Table [5] 





excellent 


very good 


foreign-born 
good 


fair 


poor 


excellent 


10 


21 


22 


5 





very good 


24 


53 


56 


15 


3 


US-born good 


21 


30 


34 


11 





fair 


3 


11 


8 


4 


1 


poor 


1 


1 


1 









Table 7: Another variation on Table [5] 





excellent 


very good 


foreign-born 
good 


fair 


poor 


excellent 


10 


21 


22 


5 





very good 


24 


53 


43 


15 


3 


US-born good 


21 


43 


34 


19 





fair 


3 


11 





4 


1 


poor 


1 


1 


1 
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4.7 A modified geometric law for the species of butterfiies 



C. B. Williams, R. A. Fisher, and A. S. Corbet reported in [9] on 5300 butterflies from 
217 readily identified species (these exclude the 23 most common readily identified species) 
they collected via random sampling at the Rothamsted Experimental Station in England. 
Figure [10] plots the numbers of individual butterfiies collected from the 217 species when 
sorted in rank order (so that the numbers are nonincreasing) . 

To build a model appropriate for Figure [TDl we must include a permutation of the bins 
as a parameter, since we have sorted the data (see Subsection 14.21 for further discussion of 
sorting and permutations). We take the model to be 

for k = 1, 2, . . . , 216, 217, where is a permutation of the integers 1, 2, ... , 216, 217, the 
parameter 6i is a positive real number less than 1, and 



1 



= T^MrTTTTTw^^; (20) 



we estimate 6'o and 6i via maximum-likelihood methods (thus obtaining 6o by sorting the 
frequencies into nonincreasing order). Please note that this model is not very carefully chosen 
— the model is just a truncated geometric distribution weighted by the nonsingular function 
1 / \/0oik) + 23, with 23 being the number of common species omitted from the collection. 
More complicated models may fit better. 

The significance levels calculated via 4,000,000 Monte-Carlo simulations are 

• x^: -0050 

• G^: .349 

• Freeman- Tukey: .951 

• root-mean-square: .00002, i.e., 2E-5 

As Figure [TD] indicates, the discrepancy between the empirical data and the model is sub- 
stantial, and, given the large number of draws (5300), cannot be due solely to random 
fiuctuations. The log-likelihood-ratio (G^) and Freeman- Tukey statistics are unable to de- 
tect this discrepancy, while the root-mean-square easily determines that the discrepancy is 
very highly significant. 



4.8 A modified geometric law for religious affiliations 

The Pew Forum on Religion and Public Life (a project of the Pew Research Center) recently 
released [12] — a report on the religious affiliations of Americans — based on a 2007 survey 
of 35,556 individuals from the continental United States (the full report includes data on 
Alaska and Hawaii, too, but we chose not to incorporate these). We analyze the identifica- 
tions reported in the variable "DENOM" from the publicly available data set ("DENOM" 
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Figure 10: Numbers of specimens (the dots) from 217 species of butterflies (one bin per 
species), and the best-fit distribution (the hues) 



provides the most detailed information on religious affiliations). The 35,556 randomly se- 
lected Americans reported affiliations with 372 different religious denominations (of course, it 
is unlikely that the sample included members from every denomination to which Americans 
belong; there are undoubtedly more than 372 denominations). Figure [11] plots the numbers 
of surveyed individuals associated with the various religious denominations when sorted in 
rank order (so that the numbers are nonincreasing) . 

To build a model appropriate for Figure [HI we must include a permutation of the bins 
as a parameter, since we have sorted the data (see Subsection 14.21 for further discussion of 
sorting and permutations). Furthermore, the tail of the distribution plotted in Figure [TT] 
seems to be more easily modeled than the full distribution. In order to focus the goodness- 
of-fit test on the tail alone, we can introduce one parameter per bin outside the tail, with 
the parameter being the probability of drawing the bin under the model. With such a 
parameter, the model will fit the empirical data exactly in the associated bin — there will 
be no discrepancy between the data and the model in that bin — so that the bin will 
not contribute to any goodness-of-fit statistic, aside from altering the number of draws in 
the remaining bins. To summarize, we need the following parameters: a permutation Oq 
associated with sorting the data, real numbers 6'i, 6*2, 6^54, 65^ specifying the probabihties 
associated with the first 55 bins in the sorted distribution, and a parameter 6*56 associated 
with the model distribution for the tail (which we choose to be a geometric distribution). 
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Thus, we arrive at the model 



^(n n 9^-1 W = 1,2,. ..,54,55 

(21) 

for = 1, 2, 3, . . . , where 9q is a permutation of the positive integers, and 9i, 92, ... , 6*55, 9^^ 
are real numbers between and 1. While this model may seem complicated at first glance, 
the estimation of its parameters is actually very simple: first we sort the frequencies into 
nonincreasing order (thus obtaining 6^0), then we set 6*1, 92, . . . , 6^54, 6^55 to be the 55 greatest 
numbers of draws divided by the total number (35,556) of draws, and finally we choose 9^^ 
to be the base of the geometric distribution which best fits the remaining numbers of draws 
in the maximum-likelihood sense. The permutation ^0 lets us sort the data so that the 
frequencies are in nonincreasing order. The parameters 9i, 92, . . . , ^54, 9^^ effectively allow 
us to ignore the bins with the 55 greatest numbers of draws, as our model fits those bins 
exactly, by construction. The parameter 9^^ is the base in the geometric distribution which 
best fits the tail of the distribution of the data. Figure [TT] plots the numbers of surveyed 
individuals associated with the various religious denominations when sorted in rank order 
(so that the numbers are nonincreasing), as well as the best-fit model distribution defined 
in fl2T]) . Of the 35,556 total surveyed individuals, 4,050 are not associated with the 55 most 
popular denominations (that is, 4,050 are not associated with the bins containing the 55 
greatest numbers of surveyed individuals). 

Since the model defined in (^T^ involves infinitely many bins, this provides a good oppor- 
tunity to consider an example of rebinning. Instead of using fl2T]) directly, we rebin so that 
there are only n = 340 bins in all, aggregating the numbers of draws from bins 340, 341, 342, 
. . . in the original distribution to be the number of draws for bin 340 in the rebinned distri- 
bution. We employ the rebinning only for the calculation of the goodness-of-fit statistics; we 
estimate all parameters, 6*0, 6*1, 6*55, ^56, directly from the data without rebinning, and we 
generate draws from the estimated model distribution without rebinning when computing 
the significance levels via Monte-Carlo simulations. (Strictly speaking, for the parameter 
estimation and Monte-Carlo simulations, we rebin the infinitely many bins down to only 
34,000, but for these purposes 34,000 is effectively infinite.) 

The significance levels calculated via 1,000,000 Monte-Carlo simulations are then 

• x^: -460 

• G^: .984 

• Freeman- Tukey: .992 

• root-mean-square: .0011 

As Figure [11] indicates, the discrepancy between the empirical data and the model is sub- 
stantial, and, given the large number of draws, cannot be due solely to random fluctuations. 
The classical statistics are unable to detect this discrepancy, while the root-mean-square 
easily determines that the discrepancy is highly significant. 
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Figure 11: Numbers of surveyed Americans (the dots) identifying with 400 different reUgious 
denominations (the bins), and the best- fit distribution (the fines); tfie fit is perfect for bins 
1-55 by definition 
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5 The power and efficiency of the root- mean- square 



In this section, we consider many numerical experiments and models, plotting the numbers 
of draws required for goodness-of-fit statistics to detect divergence from the models. We 
consider both fully specified models and parameterized models. To quantify a statistic's 
success at detecting discrepancies from the models, we use the formulation of the following 
remark. 

Remark 5.1. We say that a statistic based on given i.i.d. draws "distinguishes" the actual 
underlying distribution of the draws from the model distribution to mean that the computed 
confidence level is at least 99% for 99% of 40,000 simulations, with each simulation generating 
m i.i.d. draws according to the actual distribution. (Recall that a significance level of a is 
the same as a confidence level of 1 — a.) We computed the confidence levels by conducting 
another 40,000 simulations, with each simulation generating m i.i.d. draws according to the 
model distribution. In Appendix [XJ we use a weaker notion of "distinguish" — we say that a 
statistic based on given i.i.d. draws "distinguishes" the actual underlying distribution of the 
draws from the model distribution to mean that the computed confidence level is at least 
95% for 95% of 40,000 simulations, while running simulations and computing confidence 
levels exactly as for the plots in the present section. 

Remark 5.2. To compute the confidence levels for each example in Subsection l5.2l we should 
in principle calculate the maximum-likelihood estimate 6 for each of 40,000 simulations and 
(for each goodness-of-fit statistic) use these estimates to perform (40,000)^ times the three- 
step procedure described in Remark l3.3[ The computational costs for generating the plots in 
Subsection 15.21 would then be excessive. Instead, when computing the confidence levels as a 
function of the value of the statistic under consideration, we calculated 6 only once, using as 
the empirical data 1,000,000 draws from the underlying distribution, and (for each goodness- 
of-fit statistic) performed 40,000 times the three-step procedure described in Remark 13. 3[ 
using the single value of 6. The parameter estimates did not vary much over the 40,000 
simulations, so approximating the confidence levels thus is accurate. Furthermore, when 
the parameter is just a permutation, as in Subsections 15.2.81 and I5.2T91 the "approximation" 
described in the present remark is exactly equivalent to recomputing the confidence levels 
40,000 times — we are not making any approximation at all. Please note that we did 
recalculate the maximum-likelihood estimate 6 (and 6 from Remark 13.31) for each of 40,000 
simulations when computing the values of the statistics for the simulation; however, when 
calculating the confidence levels as a function of the values of the statistics, we always drew 
from the model distribution associated with the same value of the parameter. 

Remark 5.3. The root-mean-square statistic is not very sensitive to relative discrepancies 
between the model and actual distributions in bins whose associated model probabilities are 
small. When sensitivity in these bins is desirable, we recommend using both the root-mean- 
square statistic and an asymptotically equivalent variation of x^, such as the log-likelihood- 
ratio or "G^" test; see, for example, [IT]. 
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5.1 Examples without parameter estimation 



5.1.1 A simple, illustrative example 



Let us first specify tlie model distribution to be 



1 

4' 



(22) 



1 

4' 



(23) 



and 



1 



(24) 



Pk = 



2n -4 



for = 3, 4, . . . , — 1, ra. We consider m i.i.d. draws from tfie distribution 



3 



(25) 




1 

P2 = o' 



(26) 



and 



Pk = Pk 



(27) 



for = 3, 4, . . . , n — 1, n, where pa, p^, . . . , Pn-i, Pn are the same as in 1^^. 

Figure [T2] plots the percentage of 40,000 simulations, each generating 200 i.i.d. draws ac- 
cording to the actual distribution defined in fl25l) - fl27j) . that are successfully detected as not 
arising from the model distribution at the 1% significance level (meaning that the associated 
statistic for the simulation yields a confidence level of 99% or greater). We computed the sig- 
nificance levels by conducting 40,000 simulations, each generating 200 i.i.d. draws according 
to the model distribution defined in ( l22|) - (!2^ . Figure [12] shows that the root-mean-square 
is successful in at least 99% of the simulations, while the classical statistic fails often, 
succeeding in less than 80% of the simulations for n = 16, and less than 5% for n > 256. 

Figure [T3] plots the number m of draws required to distinguish the actual distribution 
defined in (I25|) - fl27|) from the model distribution defined in f l22|) -f lM|) . Remark 15.11 above 
specifies what we mean by "distinguish." Figure [13] shows that the root-mean-square requires 
only about m = 185 draws for any number n of bins, while the classical statistic requires 
90% more draws for n = 16, and greater than 300% more for n > 128. Furthermore, the 
classical statistic requires increasingly many draws as the number n of bins increases, 
unlike the root-mean-square. 
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Figure 12: First example, with m = 200 draws; see Subsection I5.1.1I 




29 





400 


if) 


350 


CO 




L_ 
T3 


300 








250 





200 


E 






150 


T3 




CD 


100 













50 









32 64 1 28 
number (n) of bins 

Figure 14: Second example; see Subsection I5.1.2I 



512 



5.1.2 Truncated power-laws 

Next, let us specify the model distribution to be 



for = 1, 2, 



n — 1, ra, where 



1 



We consider m i.i.d. draws from the distribution 



- C2 



for /c = 1, 2, . . . , n — 1, n, where 



(28) 



(29) 



(30) 



(31) 



Figure [14] plots the number m of draws required to distinguish the actual distribution 
defined in (130|) and (I3T!) from the model distribution defined in (!28|) and (!29|) . Remark 15.11 
above specifies what we mean by "distinguish." Figure [Ml shows that the classical statistic 
requires increasingly many draws as the number n of bins increases, while the root-mean- 
square exhibits the opposite behavior. 
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Figure 15: Third example; see Subsection 15.1. 31 



5.1.3 Additional truncated power-laws 

Let us again specify the model distribution to be 



k 

for k = 1, 2 n — 1, ra, where 



Pk = ^ (32) 



We now consider m i.i.d. draws from the distribution 

C 



Ci — -F^H — 777- (33) 



for k = 1, 2, . . . , n — 1, n, where 



Ci/2 = ^„ ^ - (35) 



Figure dS] plots the number m of draws required to distinguish the actual distribution 
defined in (IMj) and ([2SD from the model distribution defined in (152]) and (15^ . Remark [5.11 
above specifies what we mean by "distinguish." The root-mean-square is not uniformly more 
powerful than the other statistics in this example; see Remark 15.31 at the beginning of the 
present section. 
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5.1.4 Additional truncated power-laws, reversed 

Let us next specify the model distribution to be 

Cl/2 



for k = 1, 2 n — 1, n, where 



We now consider m i.i.d. draws from the distribution 



P. = ^ (36) 



= ^„ I . rr - (37) 



Pk = ^ (38) 

for /c = 1, 2, . . . , n — 1, n, where 

Figure [16] plots the number m of draws required to distinguish the actual distribution 
defined in ( |38l) and (1391) from the model distribution defined in (l36ll and (1371) . Remark [5.11 
above specifies what we mean by "distinguish." Figure [16] shows that the classical statis- 
tic requires many times more draws than the root-mean-square, as the number n of bins 
increases. 
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Figure 17: Fifth example; see Subsection I5.1.5I 



512 



5.1.5 A final example with fully specified truncated power-laws 

Let us next specify the model distribution to be 



for k = 1, 2, . . . , n — 1, n, where 



Co 



C2 



1 



We again consider m i.i.d. draws from the distribution 



(40) 



(41) 



for k = 1, 2, . . . , n — 1, n, where 



(42) 



(43) 



Figure [17] plots the number m of draws required to distinguish the actual distribution 
defined in (H2|) and ( H3l) from the model distribution defined in ( 140|) and ( HTl) . Remark 15.11 
above specifies what we mean by "distinguish." The root-mean-square is not uniformly more 
powerful than the other statistics in this example; see Remark 15.31 at the beginning of the 
present section. 
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Figure 18: Sixth example; see Subsection I5.1.6I 
5.1.6 Modified Poisson distributions 

Let us specify the model distribution to be the (truncated) Poisson distribution 



B 



Pk 



3n/8 



m 



k-l 



{k-l)\ 



for k = 1, 2, . . . , n — 1, n, where 



B 



1 



3n/8 



ELi(f)'"/(^-i)! 

We consider m i.i.d. draws from the distribution 

P(3n/8)-l = 5/10, 

p3n/8 = 4:8/ 5, 
P(3n/8)+l = 5'/10, 
S = P{3n/8)-l + P3n/8 + P(3n/8)+l, 
Pk = Pk 

for the remaining values of k (for k = 1, 2, . . . , — 3, ^ — 2 and k 



(44) 



(45) 



(46) 

(47) 
(48) 
(49) 
(50) 



3n I o 3n I q 

„ -t- z, „ -t- O, 



— 1, ra), where pk is defined in 
Figure [18] plots the number m of draws required to distinguish the actual distribution 
defined in ( 146|) -( 150|) from the model distribution defined in ( 14^ and fH5|) . Remark |5 . 1 1 above 
specifies what we mean by "distinguish." 
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Figure 19: Seventh example; see Subsection 15 .1.71 



5.1.7 A truncated power-law and a truncated geometric distribution 

Let us finally specify the model distribution to be 



for = 1, 2, . . . , 



99, 100, where 



We consider m i.i.d. draws from the (truncated) geometric distribution 



Pk = Ctt 



for = 1, 2, 



99, 100, where 



Ct 



(51) 
(52) 

(53) 
(54) 



Figure [19] considers several values for t. 

Figure [12] plots the number m of draws required to distinguish the actual distribution 
defined in flS^ and ([SID from the model distribution defined in flHT]) and (15^ . Remark 15.11 
above specifies what we mean by "distinguish." See the next section. Subsection 15.2.11 for a 
similar example, this time involving parameter estimation. 
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Figure 20: First example; see Subsection I5.2.1I 



5.2 Examples with parameter estimation 

5.2.1 A truncated power-law and a truncated geometric distribution 

We turn now to models involving parameter estimation (for details, see [H]). Let us specify 
the model distribution to be the Zipf distribution 

Pk{e) = ^ (55) 

for A; = 1, 2, ... , 99, 100, where 

we estimate the parameter 6 via maximum-likelihood methods. We consider m i.i.d. draws 
from the (truncated) geometric distribution 

Pk = ct (57) 

for = 1, 2, . . . , 99, 100, where 



Figure [20] considers several values for t. 

Figure [20] plots the number m of draws required to distinguish the actual distribution 
defined in fl57p and (1581) from the model distribution defined in (1551) and (jSHj), estimating the 
parameter 9 in (ITO]) and (jHSj) via maximum-likelihood methods. Remark 15.11 above specifies 
what we mean by "distinguish." 
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Figure 21: Second example; see Subsection I5.2.2I 



5.2.2 A rebinned geometric distribution and a truncated power-law 

Let us specify the model distribution to be 

Pkie) = e>'-\i-e) (59) 

for = 1, 2, . . . , 98, 99, and 

Pioo(^) = 0^'; (60) 

we estimate the parameter 6 via maximum-likelihood methods. We consider m i.i.d. draws 
from the Zipf distribution 

P, = ^ (61) 



for A; = 1, 2, ... , 99, 100, where 



1 



Figure [21] considers several values for t. 

Figure [21] plots the number m of draws required to distinguish the actual distribution 
defined in ([6T|) and (1621) from the model distribution defined in ([59]) and (!60|) . estimating the 
parameter 6 in ([59]) and (!60|) via maximum-likelihood methods. Remark 15.11 above specifies 
what we mean by "distinguish." 
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Figure 22: Third example; see Subsection I5.2.3I 



5.2.3 Truncated shifted Poisson distributions 

Let us specify the model distribution to be the (truncated) Poisson distribution 

Be e^-^ 



Pk{0) 



{k-l)\ 



for A; = 1, 2, ... , 20, 21, where 



(63) 



(64) 



we estimate the parameter 6 via maximum-likelihood methods. We consider m i.i.d. draws 
from the distribution 

Bt 5'=-^+* , , 
Pk = — — (65) 



{k-i + ty. 



for k = l,2 



, • • • 5 



20, 21, where 



B, 



(66) 



Figure [221 considers several values for t. Clearly, = Pk{5) for /c = 1, 2, . . . , 20, 21, if t = 0. 

Figure [22] plots the number m of draws required to distinguish the actual distribution 
defined in (l65l) and ([661) from the model distribution defined in (|63l) and (|M1) . estimating the 
parameter 6' in (!63l) and (Ell) via maximum-likelihood methods. Remark 15. II above specifies 
what we mean by "distinguish." 
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Figure 23: Fourth example; see Subsection 15. 2. 41 
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5.2.4 Modified Poisson distributions 

Let us specify the model distribution to be the (truncated) Poisson distribution 



Pk{e) 



{k-i)\ 



for k = 1, 2, . . . , n — 1, n, where 



(67) 



(68) 



ELi^'-V(^-i)!' 

we estimate the parameter 6 via maximum-likelihood methods. We consider m i.i.d. draws 
from the distribution 

P(3n/8)-l = 5'/10, (69) 

P3„/8 = 4^/5, (70) 

P(3n/8)+l = 5/10, (71) 

-S" =P(3n/8)-i(3n/8) +P3„/8(3n/8) +p(3„/8)+i(3ra/8), (72) 

and 

Pk=Pk{^n/8) (73) 

for the remaining values of A; (for A; = 1, 2, . . . , f - 3, f - 2 and A; = f + 2, f + 3, . . . , 
n — 1, n), where pk is defined in (1671) . 

Figure [23] plots the number m of draws required to distinguish the actual distribution 
defined in (|69l) - (!73l) from the model distribution defined in ( |67j) and ( |68ll . estimating the 
parameter 6 in (1671) and (!68ll via maximum-likelihood methods. Remark 15. II above specifies 
what we mean by "distinguish." 
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Figure 24: Fifth example; see Subsection I5.2.5I 



5.2.5 An example with a uniform tail 

Let us specify the model distribution to be 

PiiO) = 9, (74) 



and 



P2(9) = i - 9, (75) 



pdO) = ^ (76) 



for = 3, 4, . . . , n — 1, n; we estimate the parameter 6 via maximum- likelihood methods. 
We consider m i.i.d. draws from the distribution 

Pi = i (77) 



and 



P-2 = I, (76 



for k = 3, A, . . . , n — 1, n. 

Figure [2l] plots the number m of draws required to distinguish the actual distribution 
defined in (I77|) - (I79]) from the model distribution defined in (I71I) - (I7^ . estimating the param- 
eter 6 in fl7H) - fl76l) via maximum-likelihood methods. Remark 15.11 above specifies what we 
mean by "distinguish." 
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Figure 25: Sixth example; see Subsection I5.2.6I 
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5.2.6 Another example with a uniform tail 

Let us specify the model distribution to be 

Piio) = e, 
P2{e) = e, 



Pk{0) 



-26, 

1 



(80) 
(81) 
(82) 

(83) 



2n-6 

for k = A, 5, . . . , n — 1, n; we estimate the parameter 9 via maximum-likelihood methods 
We consider m i.i.d. draws from the distribution 

1 

Pi 



P2 
P3 



4' 
1 

8' 
1 

8' 
1 



Pk 



(84) 

(85) 
(86) 
(87) 



for A; = 4, 5 
Figure 



2n-6 

. . , n — 1, n. 

plots the number m of draws required to distinguish the actual distribution 
defined in (!8^ - (!8711 from the model distribution defined in (!80l) -(|83l ) . estimating the param- 
eter 6 in (|80l) - (l83l) via maximum-likelihood methods. Remark 15.11 above specifies what we 
mean by "distinguish." 
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Figure 26: Seventh example; see Subsection 15 .2 .71 



5.2.7 A model with an integer-valued parameter 

Let us specify the model distribution to be 



PkiO) = ^ (88) 



for = 1, 2, . . . , ^ - 1, ^, and 



MO) = (89) 

for k = 6 + 1, 6 + 2, n — 1, n; we estimate the parameter 6 via maximum-likelihood 
methods. We consider m i.i.d. draws from the distribution 

Pi = \, (90) 
P2 = I, (91) 
P3 = I, (92) 

- ^2 ^''^ 
for k = 4, 5, . . . , n — 1, n. 

Figure [26] plots the number m of draws required to distinguish the actual distribution 
defined in (|90l) - (!93l) from the model distribution defined in ( |88l) and ( |89ll . estimating the 
parameter 6 in (1881) and (!89ll via maximum-likelihood methods. Remark 15. II above specifies 
what we mean by "distinguish." 



and 
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5.2.8 Truncated power-laws parameterized with a permutation 

Let us specify the model to be the Zipf distribution 

= ^ (94) 
for k = 1, 2, . . . , n — 1, n, where ^ is a permutation of the integers 1, 2, . . . , n — 1, n, and 

we estimate the permutation 6 via maximum-hkehhood methods, that is, by sorting the 
frequencies: first we choose ki to be the number of a bin containing the greatest number of 
draws among all n bins, then we choose ^2 to be the number of a bin containing the greatest 
number of draws among the remaining n — 1 bins, then we choose k^ to be the number of a 
bin containing the greatest among the remaining n — 2 bins, and so on, and finally we find 
e such that e{ki) = 1, ^(^2) = 2, . . . , e(fc„_i) =n-l, 0(A;„) = n. 
We consider m i.i.d. draws from the distribution 



for k = 1, 2, . . . , n — 1, n, where 



= § (96) 



Figure [27| plots the number m of draws required to distinguish the actual distribution 
defined in (l96l) and ( |97j) from the model distribution defined in (j9lj) and (1951) . estimating 
the parameter 6 in (|94l) via maximum-likelihood methods (that is, by sorting). Remark 15. II 
above specifies what we mean by "distinguish." 
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Figure 28: Ninth example; see Subsection I5.2.9I 
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5.2.9 Another model parameterized with a permutation 

Let us specify the model distribution to be 

r 3/8, e{k) = 1 

PkiO) = I 1/8, 9ik) = 2 (98) 

[ l/{2n - 4), e{k) = 3, 4, . . . , n - 1, orn 

for k = 1, 2, . . . , n — 1, n, where 6' is a permutation of the integers 1, 2, . . . , — 1, n; 
we estimate the permutation 6 via maximum-likelihood methods, that is, by sorting the 
frequencies: first we choose ki to be the number of a bin containing the greatest number of 
draws among all n bins, then we choose ^2 to be the number of a bin containing the greatest 
number of draws among the remaining n — 1 bins, then we choose k^ to be the number of a 
bin containing the greatest among the remaining n — 2 bins, and so on, and finally we find 
e such that e{ki) = 1, ^(^2) = 2, . . . , ^(A;„_i) = n - 1, 0(A;„) = n. 
We consider m i.i.d. draws from the distribution 

Pi = 1/4, (99) 

p2 = 1/4, (100) 

and 

Pfc = l/(2n-4) (101) 

for k = 3, 4, . . . , n — 1, n. 

Figure [28] plots the number m of draws required to distinguish the actual distribution 
defined in (l99l) -( fT0Tl) from the model distribution defined in (|98l) . estimating the parameter 
6 in (l98l) via maximum-likelihood methods (that is, by sorting). Remark 15.11 above specifies 
what we mean by "distinguish." 
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Figure 29: Tenth example; see Subsection 15.2.101 



5.2.10 A model with two parameters 

For the final example, let us specify the model distribution to be 

Pi (^1,^2) = ^1, (102) 

P2{eue2) = e,, (103) 

P3(^l,^2) =^2, (104) 

P4(^l,^2) =^2, (105) 

and 

Me^,e,) = (106) 

n — A 

for k = 5, 6, n — 1, n; we estimate the parameters 6i and 62 via maximum-likelihood 
methods. We consider m i.i.d. draws from the distribution 

P. - |, (107) 
P2 = 4' (108) 



P3 = ^, (109) 

P4 = 7^, (110) 
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and 



for k = 5, Q, . . . , n — 1, n. 

Figure [29] plots the number m of draws required to distinguish the actual distribution 
defined in fll07l) - flllip from the model distribution defined in fll02p - fll06l) . estimating the 
parameters 6i and 62 in fll02p - fll06l) via maximum-likelihood methods. Remark 15.11 above 
specifies what we mean by "distinguish." 
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A Additional plots of power and efficiency 

For each plot in Section [5], this appendix provides a corresponding plot based on a confidence 
level of 95% (that is, a significance level of 5%), rather than a confidence level of 99% (that 
is, a significance level of 1%). In this appendix Figures [3TI - I471 set the probabilities of false 
positives and false negatives both to be 5% in order to determine the required number m of 
draws, whereas in Section above Figures set the probabilities of false positives and 

false negatives both to be 1% (see Remark 15.11) . Similarly, a rejection is deemed successful 
for Figure |30] at the 5% significance level (or better), whereas a rejection is deemed successful 
for Figure [12] only at the stricter 1% significance level (or better). 
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Figure 31: First example (statistical "efficiency"); see Subsection 15. 1.11 
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Figure 32: Second example; see Subsection 15. 1.21 



250 




8 16 32 64 128 256 512 



number (n) of bins 

Figure 33: Third example; see Subsection I5.1.3I 
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Figure 34: Fourth example; see Subsection I5.1.4I 
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Figure 35: Fifth example; see Subsection 15 .LSI 
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Figure 36: Sixth example; see Subsection I5.1.6I 
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Figure 37: Seventh example; see Subsection I5.1.7I 
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Figure 38: First example; see Subsection I5.2.1I 




Figure 39: Second example; see Subsection 15.2. 2[ 
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Figure 40: Third example; see Subsection 15. 2.31 
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Figure 41: Fourth example; see Subsection I5.2.4I 
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Figure 42: Fifth example; see Subsection I5.2.5I 



5000 



4000 



3000 



2000 



1000 



Freeman-Tukey, 



G 

1c 



root-mean-square 



8 16 32 64 128 256 51 
number (n) of bins 



Figure 43: Sixth example; see Subsection I5.2.6I 
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Figure 44: Seventh example; see Subsection I5.2.7I 
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Figure 45: Eighth example; see Subsection I5.2.8I 
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Figure 46: Ninth example; see Subsection I5.2.9I 




Figure 47: Tenth example; see Subsection I5.2.10I 



55 



B Convergence to asymptotic levels 



In this appendix, we investigate the convergence rates of significance levels to their asymp- 
totic values in the limit of large numbers of draws. We take all draws directly from the 
model distributions, and focus on models with real-valued parameters. Needless to say, 
the model parameters are almost surely known exactly in the limit of large numbers of 
draws. Maximum-likelihood estimates of the parameters converge to the actual values 
relatively fast in all examples considered below; the significance levels for the root-mean- 
square converge as fast or faster than those for the classical statistics from the Cressie-Read 
power-divergence family (the classical statistics are x^? the log-likelihood-ratio G^, and the 
Freeman- Tukey/Hellinger distance) . 

Figures HHHSU plot the exact significance level versus the level in the limit of large numbers 
of draws (we computed the asymptotic levels via the method detailed in [IS]). The exact 
significance level is the estimate obtained via ^ = 800,000 Monte-Carlo simulations, with 
each simulation generating m draws according to the model distribution for the values of the 
parameters specified below. The significance levels obtained via simulations include error 
bars whose heights (top to bottom) are about twice the standard errors of the estimated 
levels; we used Remark 13.41 to estimate the standard errors. Please note that, as the number 
m of draws increases, the plotted traces converge to the straight line through the origin of 
unit slope (as they should). 

Figure HH] plots the exact significance level (estimated via simulation) versus the level in 
the limit of large numbers of draws, for the model distribution 

p,{9) = 9'-\l - 9) (112) 

for /c = 1, 2, . . . , 8, 9, and 

Pio{0) = 0'; (113) 

we estimate the parameter 9 for the goodness-of-fit statistics via maximum-likelihood meth- 
ods, using 9 = 7/10 in the generation of the m i.i.d. draws for each of the 800,000 simulations. 
Of the four statistics considered, the root-mean-square clearly converges the fastest, as the 
number m of draws increases. 

Figure |49] plots the exact significance level (estimated via simulation) versus the level in 
the limit of large numbers of draws, taking the model to be the Zipf distribution 

Pk{e) = ^ (114) 

for = 1, 2, . . . , 9, 10, where 

Ce = —m^ ; (115) 

EllilA^ 

we estimate the parameter 9 for the goodness-of-fit statistics via maximum-likelihood meth- 
ods, using 9 = 7/2 in the generation of the m i.i.d. draws for each of the 800,000 simulations. 
Of the four statistics considered, the root-mean-square converges by far the fastest. 

Figure [50] plots the exact significance level (estimated via simulation) versus the level in 
the limit of large numbers of draws, taking the model to be the Zipf distribution 

Pk{0) = S (116) 
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for A; = 1, 2, ... , 99, 100, where 



Ce 



1 



(117) 




.e 



we estimate the parameter 6 for the goodness-of-fit statistics via maximum-hkehhood meth- 
ods, using ^ = 5/2 in the generation of the m i.i.d. draws for each of the 800,000 simulations. 
Of the four statistics considered, the root- mean-square converges by far the fastest, as the 
number m of draws increases. 

Figure ED plots the exact significance level (estimated via simulation) versus the level in 
the limit of large numbers of draws, for the two-parameter model distribution 



for /c = 1, 2, . . . , 19, 20; we estimate the parameters 9i and 62 for the goodness-of-fit statistics 
via maximum-likelihood methods, using 9i = 9/40 and 62 = 3/20 in the generation of the 
m i.i.d. draws for each of the 800,000 simulations. The root-mean-square and statistics 
behave similarly, converging faster than the log-likelihood-ratio, G^, and Freeman- Tukey 
statistics, as the number m of draws increases. 

Remark B.l. It is possible to accelerate the convergence via higher-order asymptotics. 
Presumably such acceleration is possible for all four statistics considered in this appendix. 

Remark B.2. For any family p{6) of discrete probability distributions parameterized by a 
permutation 6 that specifies the order of the bins (meaning that there exists a discrete prob- 
ability distribution r such that Pk{0) = f^e{k) for all A;), and for any number m of draws, the 
confidence levels defined in Remark 13.31 have the following highly desirable property: Suppose 
that the actual underlying distribution p of the experimental draws is equal to p{9) for some 
(unknown) 6. Suppose further that 7 is the confidence level for rejecting ([7j), calculated for a 
particular realization of the experiment (the associated significance level is a = 1 — 7). Con- 
sider repeating the same experiment over and over, and calculating the confidence level for 
each realization, each time using that realization's particular maximum-likelihood estimate 
of the parameter in the hypothesis ([7j). Then, the fraction of the confidence levels that are 
less than 7 is equal to 7 in the limit of many repetitions of the experiment. This property 
is a compelling reason to use d{Q,p{Q)) rather than d{Q,p{9)) in the left-hand side of Q. 
Also, the procedure of Remark 13.31 can be viewed as a parametric bootstrap approximation 
(see, for example, [2] and [6j). 

In addition, for any family p{6) of discrete probability distributions, the confidence levels 
defined in Remark 13.31 have the following highly desirable property: Suppose that the actual 
underlying distribution p of the experimental draws is equal to p{6) for some (unknown) 6. 
Consider repeating the experiment over and over, and calculating the confidence level for 
each realization, each time using that realization's particular maximum-likelihood estimate 
of the parameter in the hypothesis ([7]). Then, the resulting confidence levels converge in 
distribution to the uniform distribution over (0, 1) in the limit of large numbers of draws. 

It may be somewhat fortuitous that the scheme in Remark 13.31 has so many favorable 
properties — see, for example, |lj and [T8] . 
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Figure 48: Convergence for a (rebinned) geometric distribution with n = 10 bins 
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Figure 49: Convergence for a Zipf distribution with n = 10 bins 
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Figure 50: Convergence for a Zipf distribution with n = 100 bins 
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Figure 51: Convergence for a two-parameter model with n = 20 bins 
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